12 research outputs found
GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer
Named Entity Recognition (NER) is essential in various Natural Language
Processing (NLP) applications. Traditional NER models are effective but limited
to a set of predefined entity types. In contrast, Large Language Models (LLMs)
can extract arbitrary entities through natural language instructions, offering
greater flexibility. However, their size and cost, particularly for those
accessed via APIs like ChatGPT, make them impractical in resource-limited
scenarios. In this paper, we introduce a compact NER model trained to identify
any type of entity. Leveraging a bidirectional transformer encoder, our model,
GLiNER, facilitates parallel entity extraction, an advantage over the slow
sequential token generation of LLMs. Through comprehensive testing, GLiNER
demonstrate strong performance, outperforming both ChatGPT and fine-tuned LLMs
in zero-shot evaluations on various NER benchmarks.Comment: Work in progres
DyREx: Dynamic Query Representation for Extractive Question Answering
Extractive question answering (ExQA) is an essential task for Natural
Language Processing. The dominant approach to ExQA is one that represents the
input sequence tokens (question and passage) with a pre-trained transformer,
then uses two learned query vectors to compute distributions over the start and
end answer span positions. These query vectors lack the context of the inputs,
which can be a bottleneck for the model performance. To address this problem,
we propose \textit{DyREx}, a generalization of the \textit{vanilla} approach
where we dynamically compute query vectors given the input, using an attention
mechanism through transformer layers. Empirical observations demonstrate that
our approach consistently improves the performance over the standard one. The
code and accompanying files for running the experiments are available at
\url{https://github.com/urchade/DyReX}.Comment: Accepted at "2nd Workshop on Efficient Natural Language and Speech
Processing (ENLSP-II)" @ NeurIPS 202
Sequence Classification Based on Delta-Free Sequential Pattern
International audienceSequential pattern mining is one of the most studied and challenging tasks in data mining. However, the extension of well-known methods from many other classical patterns to sequences is not a trivial task. In this paper we study the notion of δ-freeness for sequences. While this notion has extensively been discussed for itemsets, this work is the first to extend it to sequences. We define an efficient algorithm devoted to the extraction of δ-free sequential patterns. Furthermore, we show the advantage of the δ-free sequences and highlight their importance when building sequence classifiers, and we show how they can be used to address the feature selection problem in statistical classifiers, as well as to build symbolic classifiers which optimizes both accuracy and earliness of predictions
Fouille de motifs et modélisation statistique pour l'extraction de connaissances textuelles
In natural language processing, two main approaches are used : machine learning and data mining. In this context, cross-referencing data mining methods based on patterns and statistical machine learning methods is apromising but hardly explored avenue. In this thesis, we present three major contributions: the introduction of delta-free patterns, used as statistical model features; the introduction of a semantic similarity constraint for the mining, calculated using a statistical model; and the introduction of sequential labeling rules, created from the patterns and selected by a statistical model.En traitement automatique des langues, deux grandes approches sont utilisées : l'apprentissage automatique et la fouille de données. Dans ce contexte, croiser les méthodes de fouille de données fondées sur les motifs et les méthodes d’apprentissage automatique statistique est une voie prometteuse mais à peine explorée. Dans cette thèse, nous présentons trois contributions majeures : l'introduction des motifs delta libres,utilisés comme descripteurs de modèle statistiques; l'introduction d'une contrainte de similarité sémantique pour la fouille, calculée grâce à un modèle statistique; l'introduction des règles séquentielles d'étiquetage,crées à partir des motifs et sélectionnées par un modèle statistique
Pattern mining and machine learning for extracting textual information
En traitement automatique des langues, deux grandes approches sont utilisées : l'apprentissage automatique et la fouille de données. Dans ce contexte, croiser les méthodes de fouille de données fondées sur les motifs et les méthodes d’apprentissage automatique statistique est une voie prometteuse mais à peine explorée. Dans cette thèse, nous présentons trois contributions majeures : l'introduction des motifs delta libres,utilisés comme descripteurs de modèle statistiques; l'introduction d'une contrainte de similarité sémantique pour la fouille, calculée grâce à un modèle statistique; l'introduction des règles séquentielles d'étiquetage,crées à partir des motifs et sélectionnées par un modèle statistique.In natural language processing, two main approaches are used : machine learning and data mining. In this context, cross-referencing data mining methods based on patterns and statistical machine learning methods is apromising but hardly explored avenue. In this thesis, we present three major contributions: the introduction of delta-free patterns, used as statistical model features; the introduction of a semantic similarity constraint for the mining, calculated using a statistical model; and the introduction of sequential labeling rules, created from the patterns and selected by a statistical model
Classification de texte enrichie à l'aide de motifs séquentiels
International audienceSequential pattern mining for text classification Most methods in text classification rely on contiguous sequences of words as features. Indeed, if we want to take non-contiguous (gappy) patterns into account, the number of features increases exponentially with the size of the text. Furthermore , most of these patterns will be mere noise. To overcome both issues, sequential pattern mining can be used to efficiently extract a smaller number of relevant, non-contiguous, features. In this paper, we compare the use of constrained frequent pattern mining and δ-free patterns as features for text classification. We show experimentally the advantages and disadvantages of each type of patterns.En classification de textes, la plupart des méthodes fondées sur des classifieurs statistiques utilisent des mots, ou des combinaisons de mots contigus, comme descripteurs. Si l'on veut prendre en compte plus d'informations le nombre de descripteurs non contigus augmente exponentiellement. Pour pallier à cette croissance, la fouille de motifs séquentiels permet d'extraire, de façon efficace, un nombre réduit de descripteurs qui sont à la fois fréquents et pertinents grâce à l'utilisation de contraintes. Dans ce papier, nous comparons l'utilisation de motifs fréquents sous contraintes et l'utilisation de motifs δ-libres, comme descripteurs. Nous montrons les avantages et inconvénients de chaque type de motif
High Dimensional Data Stream Clustering using Topological Representation Learning
Due to the high dimensionality of the data, storing the whole set of data during stream processing is impractical. Therefore, only a summary of the input stream is maintained, necessitating the development of specialized data structures that permit incremental summarization of the input stream. The problem becomes more complex when dealing with highdimensional text data due to the high sparsity. In this paper we propose a new topological unsupervised learning approach for high dimensional text data streams. The proposed method simultaneously learns the representation of the stream and cluster the data in a smaller dimension space. The evaluation of the proposed OTTC (Online Topological Text Clustering) approach and the comparison with the state of art methods is done by using the framework MOA (Massive Online Analysis), an open-source benchmarking software for evolving data streams. The proposed approach outperforms the classical methods and the obtained results are very promising for clustering high dimensional text data streams
Sélection globale de segments pour la reconnaissance d'entités nommées
International audienceNamed Entity Recognition is an important task in Natural Language Processing with applications in many domains. In this paper, we describe a novel approach to named entity recognition, in which we output a set of spans (i.e., segmentations) by maximizing a global score. During training, we optimize our model by maximizing the probability of the gold segmentation. During inference, we use dynamic programming to select the best segmentation under a linear time complexity. We prove that our approach outperforms CRF and semi-CRF models for Named Entity RecognitionLa reconnaissance d'entités nommées est une tâche importante en traitement automatique du langage naturel avec des applications dans de nombreux domaines. Dans cet article, nous décrivons une nouvelle approche pour la reconnaissance d'entités nommées, dans laquelle nous produisons un ensemble de segmentations en maximisant un score global. Pendant l'entraînement, nous optimisons notre modèle en maximisant la probabilité de la segmentation correcte. Pendant l'inférence, nous utilisons la programmation dynamique pour sélectionner la meilleure segmentation avec une complexité linéaire. Nous prouvons que notre approche est supérieure aux modèles champs de Markov conditionnels et semi-CMC pour la reconnaissance d'entités nommées
Sequence Classification Based on Delta-Free Sequential Pattern
International audienceSequential pattern mining is one of the most studied and challenging tasks in data mining. However, the extension of well-known methods from many other classical patterns to sequences is not a trivial task. In this paper we study the notion of δ-freeness for sequences. While this notion has extensively been discussed for itemsets, this work is the first to extend it to sequences. We define an efficient algorithm devoted to the extraction of δ-free sequential patterns. Furthermore, we show the advantage of the δ-free sequences and highlight their importance when building sequence classifiers, and we show how they can be used to address the feature selection problem in statistical classifiers, as well as to build symbolic classifiers which optimizes both accuracy and earliness of predictions
Weakly-supervised Symptom Recognition for Rare Diseases in Biomedical Text
International audienceIn this paper, we tackle the issue of symptom recognition for rare diseases in biomedical texts. Symptoms typically have more complex and ambiguous structure than other biomedical named entities. Furthermore , existing resources are scarce and incomplete. Therefore, we propose a weakly-supervised framework based on a combination of two approaches: sequential pattern mining under constraints and sequence labeling. We use unannotated biomedical paper abstracts with dictionaries of rare diseases and symptoms to create our training data. Our experiments show that both approaches outperform simple projection of the dictionaries on text, and their combination is beneficial. We also introduce a novel pattern mining constraint based on semantic similarity between words inside patterns